{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Analyze a large dataset with Google BigQuery\n",
    "\n",
    "**Learning Objectives**\n",
    "\n",
    "1. Access an ecommerce dataset\n",
    "1. Look at the dataset metadata\n",
    "1. Remove duplicate entries\n",
    "1. Write and execute queries\n",
    "\n",
    "\n",
    "## Introduction \n",
    "BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator. BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.\n",
    "\n",
    "We have a publicly available ecommerce dataset that has millions of Google Analytics records for the Google Merchandise Store loaded into a table in BigQuery. In this lab, you use a copy of that dataset. Sample scenarios are provided, from which you look at the data and ways to remove duplicate information. The lab then steps you through further analysis the data.\n",
    "\n",
    "BigQuery can be accessed by its own browser-based interface, Google Data Studio, and many third party tools.  In this lab you will use the BigQuery directly in notebook cells using the iPython magic command `%%bigquery`.\n",
    "\n",
    "The steps you will follow in the lab are analogous to what you would do to prepare data for use in advanced ML operations. You will follow the notebook to experiment with the BigQuery queries provided to analyze the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set up the notebook environment\n",
    "\n",
    "__VERY IMPORTANT__:  In the cell below you must replace the text `<YOUR PROJECT>` with you GCP project id."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "PROJECT = \"<YOUR PROJECT>\" #TODO Replace with your project id\n",
    "\n",
    "os.environ[\"PROJECT\"] = PROJECT\n",
    "\n",
    "pd.options.display.max_columns = 50"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore eCommerce data and identify duplicate records\n",
    "\n",
    "Scenario: You were provided with Google Analytics logs for an eCommerce website in a BigQuery dataset.  The data analyst team created a new BigQuery table of all the raw eCommerce visitor session data.  This data tracks user interactions, location, device types, time on page, and details of any transaction.  Your ultimate plan is to use this data in an ML capacity to create a model that delivers highly accurate predictions of user behavior to support tailored marketing campaigns.\n",
    "\n",
    "First, a few notes on BigQuery within a python notebook context.  Any cell that starts with `%%bigquery` (the BigQuery Magic) will be interpreted as a SQL query that is executed on BigQuery, and the result is printed to our notebook.\n",
    "\n",
    "BigQuery supports [two flavors](https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql#comparison_of_legacy_and_standard_sql) of SQL syntax: legacy SQL and standard SQL. The preferred is standard SQL because it complies with the official SQL:2011 standard. To instruct BigQuery to interpret our syntax as such we start the query with `#standardSQL`.\n",
    "\n",
    "Our first query is accessing the BigQuery Information Schema which stores all object-related metadata.  In this case we want to see metadata details for the \"all_sessions_raw\" table. \n",
    "\n",
    "Tip: To run the current cell you can click the cell and hit **shift enter**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "#standardsql\n",
    "SELECT * \n",
    "EXCEPT \n",
    "      (table_catalog, table_schema, is_generated, generation_expression, is_stored, \n",
    "      is_updatable, is_hidden, is_system_defined, is_partitioning_column, clustering_ordinal_position)\n",
    "FROM  `data-to-insights.ecommerce.INFORMATION_SCHEMA.COLUMNS`\n",
    "WHERE table_name=\"all_sessions_raw\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next examine how many rows are in the table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT \n",
    "#standardSQL\n",
    "SELECT count(*)\n",
    "FROM `data-to-insights.ecommerce.all_sessions_raw`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now take a quick at few rows of data in the table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT \n",
    "#standardSQL\n",
    "SELECT *\n",
    "FROM `data-to-insights.ecommerce.all_sessions_raw`\n",
    "LIMIT 7"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Identify duplicate rows\n",
    "\n",
    "Seeing a sample amount of data may give you greater intuition for what is included in the dataset. But since the table is quite large, a preview is not likely to render meaningful results.  As you scan and scroll through the sample rows you see there is no singular field that uniquely identifies a row, so you need advanced logic to identify duplicate rows.\n",
    "\n",
    "The query below uses the SQL GROUP BY function on every field and counts (COUNT) where there are rows that have the same values across every field.\n",
    "\n",
    "If every field is unique, the COUNT will return 1 as there are no other groupings of rows with the exact same value for all fields.\n",
    "If there is a row with the same values for all fields, they will be grouped together and the COUNT will be greater than 1. The last part of the query is an aggregation filter using HAVING to only show the results that have a COUNT of duplicates greater than 1.\n",
    "Run the following query to find duplicate records across all columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT \n",
    "#standardSQL\n",
    "SELECT count(*) AS num_duplicate_rows, \n",
    "       * \n",
    "FROM   `data-to-insights.ecommerce.all_sessions_raw` \n",
    "GROUP BY fullvisitorid, \n",
    "          channelgrouping, \n",
    "          time, \n",
    "          country, \n",
    "          city, \n",
    "          totaltransactionrevenue, \n",
    "          transactions, \n",
    "          timeonsite, \n",
    "          pageviews, \n",
    "          sessionqualitydim, \n",
    "          date, \n",
    "          visitid, \n",
    "          type, \n",
    "          productrefundamount, \n",
    "          productquantity, \n",
    "          productprice, \n",
    "          productrevenue, \n",
    "          productsku, \n",
    "          v2productname, \n",
    "          v2productcategory, \n",
    "          productvariant, \n",
    "          currencycode, \n",
    "          itemquantity, \n",
    "          itemrevenue, \n",
    "          transactionrevenue, \n",
    "          transactionid, \n",
    "          pagetitle, \n",
    "          searchkeyword, \n",
    "          pagepathlevel1, \n",
    "          ecommerceaction_type, \n",
    "          ecommerceaction_step, \n",
    "          ecommerceaction_option \n",
    "HAVING num_duplicate_rows > 1; "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see there are quite a few \"duplicate\" records (615) when analyzed with these parameters.\n",
    "\n",
    "In your own datasets, even if you have a unique key, it is still beneficial to confirm the uniqueness of the rows with COUNT, GROUP BY, and HAVING before you begin your analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analyze the new all_sessions table\n",
    "\n",
    "In this section you use a deduplicated table called all_sessions.\n",
    "\n",
    "Scenario: Your data analyst team has provided you with a relevant query, and your schema experts have identified the key fields that must be unique for each record per your schema.\n",
    "\n",
    "Run the query to confirm that no duplicates exist, this time against the \"all_sessions\" table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "#standardSQL\n",
    "SELECT fullvisitorid, # the unique visitor ID \n",
    "       visitid, # a visitor can have multiple visits \n",
    "       date, # session date stored as string YYYYMMDD \n",
    "       time, # time of the individual site hit  (can be 0 or more) \n",
    "       v2productname, # not unique since a product can have variants like Color \n",
    "       productsku, # unique for each product \n",
    "       type, # visit and/or event trigger \n",
    "       ecommerceaction_type, # maps to ‘add to cart', ‘completed checkout' \n",
    "       ecommerceaction_step, \n",
    "       ecommerceaction_option, \n",
    "       transactionrevenue, # revenue of the order \n",
    "       transactionid, # unique identifier for revenue bearing transaction \n",
    "       count(*) AS row_count \n",
    "FROM   `data-to-insights.ecommerce.all_sessions` \n",
    "GROUP  BY 1, \n",
    "          2, \n",
    "          3, \n",
    "          4, \n",
    "          5, \n",
    "          6, \n",
    "          7, \n",
    "          8, \n",
    "          9, \n",
    "          10, \n",
    "          11, \n",
    "          12 \n",
    "HAVING row_count > 1 # find duplicates \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The query returns zero records indicating no duplicates exist."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Write basic SQL against the eCommerce data\n",
    "\n",
    "In this section, you query for insights on the ecommerce dataset.\n",
    "\n",
    "A good first path of analysis is to find the total unique visitors\n",
    "The query below determines the total views by counting product_views and the number of unique visitors by counting fullVisitorID."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "#standardSQL\n",
    "SELECT count(*)                      AS product_views, \n",
    "       count(DISTINCT fullvisitorid) AS unique_visitors \n",
    "FROM   `data-to-insights.ecommerce.all_sessions`; "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next query shows total unique visitors(fullVisitorID) by the referring site (channelGrouping):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "#standardSQL\n",
    "SELECT count(DISTINCT fullvisitorid) AS unique_visitors, \n",
    "       channelgrouping \n",
    "FROM   `data-to-insights.ecommerce.all_sessions` \n",
    "GROUP  BY 2 \n",
    "ORDER  BY 2 DESC;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To find deeper insights in the data, the next query lists the five products with the most views (product_views) from unique visitors. The query counts number of times a product (v2ProductName) was viewed (product_views), puts the list in descending order, and lists the top 5 entries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "#standardSQL\n",
    "SELECT count(*)          AS product_views, \n",
    "       ( v2productname ) AS ProductName \n",
    "FROM   `data-to-insights.ecommerce.all_sessions` \n",
    "WHERE  type = 'PAGE' \n",
    "GROUP  BY v2productname \n",
    "ORDER  BY product_views DESC \n",
    "LIMIT  5;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now expand your previous query to include the total number of distinct products ordered and the total number of total units ordered (productQuantity):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "#standardSQL\n",
    "SELECT count(*)               AS product_views, \n",
    "       count(productquantity) AS orders, \n",
    "       sum(productquantity)   AS quantity_product_ordered, \n",
    "       v2productname \n",
    "FROM   `data-to-insights.ecommerce.all_sessions` \n",
    "WHERE  type = 'PAGE' \n",
    "GROUP  BY v2productname \n",
    "ORDER  BY product_views DESC \n",
    "LIMIT  5; "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lastly, expand the query to include the average amount of product per order (total number of units ordered/total number of orders, or `SUM(productQuantity)/COUNT(productQuantity)`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "#standardSQL\n",
    "SELECT count(*)                                      AS product_views, \n",
    "       count(productquantity)                        AS orders, \n",
    "       sum(productquantity)                          AS quantity_product_ordered, \n",
    "       sum(productquantity) / Count(productquantity) AS avg_per_order, \n",
    "       v2productname                                 AS productName \n",
    "FROM   `data-to-insights.ecommerce.all_sessions` \n",
    "WHERE  type = 'PAGE' \n",
    "GROUP  BY v2productname \n",
    "ORDER  BY product_views DESC \n",
    "LIMIT  5; "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that among these top 5 products by product views that the 22 oz YouTube Bottle Infuser had the highest avg_per_order with 9.38 units per order."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You have completed this lab exercise.   In this situation the \"all_sessions\" was provided to you with the deduplicated records.  In the course of your own future analysis you may have to create this on your own using BigQuery and the `create table DATASET.TABLE2 as select * from DATASET.TABLE1` syntax."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2019 Google Inc.\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at\n",
    "http://www.apache.org/licenses/LICENSE-2.0\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
